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METHOD OF, AND SYSTEM FOR, SCANNING ELECTRONIC DOCUMENTS 
WHICH CONTAIN LINKS TO EXTERNAL OBJECTS 

Introduction 

5 The present invention relates to a method of, and system for, replacing external 

links in electronic documents such as email with internal links. One use of this is to ensure 
that email that attempts to bypass email content scanners no longer succeeds. Another use is 
to reduce the effectiveness of web bugs. 

10 Background 

Content scanning can be carried out at a number of places in the passage of 
electronic documents from one system to another. Taking email as an example, it may be 
carried out by software operated by the user, e.g. incorporated in or an adjunct to, his email 
client, and it may be carried out on a mail server to which the user connects, over a LAN or 

15 WAN, in order to retrieve email. Also, Internet Service Providers (ISPs) can carry out 
content scanning as a value-added service on behalf of customers who, for example, then 
retrieve their content-scanned email via a POP3 account or similar. 

One trick which can be used to bypass email content scanners is to create an 
email which just contains a link (such as an HTML hyperlink) to the undesirable or "nasty" 

20 content. Such content may include viruses and other varieties of malware as well potentially 
offensive material such as pornographic images and text, spam and other material to which 
the email recipient may not wish to be subjected. The content scanner sees only the link, 
which is not suspicious, and the email is let through. However, when viewed in the email 
client, the object referred to may either be bought in automatically by the email client, or 

25 when the reader clicks on the link. Thus, the nasty object ends up on the user's desktop, 
without ever passing through the email content scanner. 
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It is possible for the content scanner to download the object by following the 
link itself. It can then scan the object. However, this method is not foolproof- for instance, 
the server delivering the object to the content scanner may be able to detect that the request is 
from a content scanner and not from the end user. It may then serve up a different, innocent 
5 object to be scanned. However, when the end-user requests the object, they get the nasty one. 



Summary of the Invention 

The present invention seeks to reduce or eliminate the problems of embedded 
links in electronic documents and does so by having the content scanner attempt to follow a 

10 link found in an electronic document and scan the object which is the target of the link. If the 
object is found to be acceptable from the point of view of content-scanning criteria, it is 
retrieved by the scanner and embedded in the electronic document and the link in the 
electronic document is adjusted to point at the embedded object rather than the original; this 
can then be delivered to the recipient without the possibility that the version received by the 

1 5 recipient differs from the one originally scanned. 

If the object is not found to be acceptable, one or more remedial actions may be 
taken: for example, the link may be replaced by a non-functional link and/or a notice that the 
original link has been removed and why; another possibility is that the electronic document 
can be quarantined and an email or alert generated and sent to the intended recipient advising 

20 him that this has been done and perhaps including a link via which he can retrieve it 
nevertheless or delete it. The process of following links, scanning the linked object and 
replacing it or not with an embedded copy and an adjusted link may be applied recursively. 
An upper limit may be placed on the number of recursion levels, to stop the system getting 
stuck in an infinite loop (e.g. because there are circular links) and to effectively limit the 

25 amount of time the processing will take. 



WO 2004/017238 PCT/GB2003/003475 

3 

Thus according to the present invention there is provided a content scanning 
system for electronic documents such as emails comprising: 

a) a link analyser for identifying hyperlinks in document content; 

b) means for causing a content scanner to scan objects referenced by links 
5 identified by the link analyser and to determine their acceptability according to predefined 

rules, the means being operative, when the link is to an object external to the document and is 
determined by the content analyser to be acceptable, to retrieve the external object and modify 
the document by 

bl . embedding in it or attaching to it the retrieved copy of the object; and 
10 b2. replacing the link to the external object by one to the copy embedded 

in, or attached to, the document. 

The invention also provides a method of content-scanning electronic 
documents such as emails comprising: 

a) using a link analyser for identifying hyperlinks in document content; 
15 b) using a content scanner to scan objects referenced by links identified by 

the link analyser and to determine their acceptability according to predefined rules, the means 
being operative, when the link is to an object external to the document and is determined by 
the content analyser to be acceptable, to retrieve the external object and modify the document 
by 

20 b 1 . embedding in it or attaching to it the retrieved copy of the object; and 

b2. replacing the link to the external object by one to the copy embedded 
in, or attached to, the document. 

Thus the content scanner can follow the link, and download and scan the 
object. If the object is judged satisfactory, the object can then be embedded in the email, and 
25 the link to the external object replaced by a link to the object now embedded in the email. 
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One trick used by spammers is to embody 'web bugs' in their spam emails. 
These are unique or semi-unique links to web sites - so a spammer sending out 1000 emails 
would use 1000 different links. When the email is read, a connection is made to the web site, 
and by finding which link has been hit, the spammer can match it with their records to tell 
5 which person has read the spam email. This then confirms that the email address is a genuine 
one. The spammer can continue to send email to that address, or perhaps even sell the address 
on to other spammers. 

By following every external link in every email that passes through the content 
scanner, all the web bugs the spammer sends out will be activated. Their effectiveness 
10 therefore becomes much reduced, because they can no longer be used to tell which email 
addresses were valid or not. 

The invention will be further described by way of non-limiting example with 
reference to the accompanying drawings, in which: - 

Figurel shows the "before" and "after" states of an email processed by an 
1 5 embodiment of the present invention; and 

Figure 2 shows a system embodying the present invention. 

Figure la shows an email 1 which comprises a header region 2 and a body 3 
formatted according to an internet (e.g. SMTP/MIME) format. The body 3 includes a 
hypertext link 4 which points to an object 5 on a web server 6 somewhere on the internet. 
20 The object 5 may for example be a graphical image embedded in a web page (e.g. HTML or 
XHTML); 

Figurel b shows the email 1 after processing by the illustrated embodiment of 
the invention and it will be seen that the object 5 has been appended to the email (e.g. as a 
MIME attachment) as item 5* and the link 4 has been adjusted so that it now points to this 
25 version of the object rather than the one held on the external server 6; and 
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Figure 2 is an illustration of a system 10, according to the present invention 
which may be implemented as a software automaton. Although the invention is not limited to 
this application, this example embodiment is given in terms of a content scanner operated by 
an ISP to process an email stream e.g. passing through an email gateway. 

5 

Operation of Embodiment 

1) The email is analysed by analyser 20 to determine whether it contains 
external links. If none are found, omit steps 2 to 5. 

2) For each external link, the external object is obtained by object fetcher 
10 30 from the internet. If the object cannot be obtained, go to step 7. 

3) The external objects are scanned by analyser 40 for pornography, 
viruses, spam and other undesirables. If any are found, go to step 7. 

4) The external objects are analysed to see whether they contain external 
links. If the nesting limit has been reached, go to step 7. Otherwise go to step 2 for each 

15 external link. 

5) The email is now rebuilt by email rebuilder 50. In the case of MIME 
email, the external links are replaced with internal links, and the objects obtained are added to 
the email as MIME sections. Non-MIME email is first converted to MIME email, and the 
process then continues as before. 

20 6) The email is sent on, and processing stops in respect of that email. 

7) An undesirable object has been found, or the object could not be 
retrieved, or the nesting limit has been reached. We may wish to block the email (processing 
stops), or to remove the links. We may also want to send warning messages to sender and 
recipient if the email has been blocked. Meanwhile the email may be held in quarantine as 

25 indicated at 60, which may be implemented as a reserved file directory. 
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Example 



The following email contains a link to a website. 



Subject: email with link 
Subject: 

Date: Thu, 9 May 2002 16:17:01 +0600 
MIME-Version: 1,0 
Content-Type : text /html ; 
Content-Transfer-Encoding : 7bit 

<!DOCTYPE HTML PUBLIC "-/ /W3C//DTD HTML 4.0 Transitional//EN"> 

<HTMLXHEAD> 

</HEAD> 

<BODY bgColor«3D#f f f £ff> § 
<DIV>  </DIV> 
This is some text<BR> 

<DIV><IMAGE src="http: //www. messagelabs.com/ images /global /nav/box-images/ virus-eye- 
light, gif" > 
</DIV> 

This is some more text<BR> 
</BODYX/HTML> 

The binary content of H http://www.messagelabls.com/images/globaynav^ox-images/vi^uS'- 
eye-light.gif ' is as follows: 
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This file can be downloaded, scanned, and if acceptable, a new email can be created with the 
image embedded in the email: 



Subject: email with link 
Subject: 

Date: Thu, 9 May 2002 16:17:01 +0600 

MIME-Version: 1.0 

Content-Type: multipart/related; 

boundary«"ABCD" ; 
Content-Transfer-Encoding: 7bit 



— ABCD 
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Content-Type : text/html ; 
Content-Transfer-Encoding: 7bit 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional/ /EN M > 
5 <HTMLXHEAD> 
</HEAD> 

<BODY bgColor=3D#ffffff> 
<DIV> </DIV> 
This is some text<BR> 
10 <DIVXIMAGE src=cid:EXTERNAL> 
</DIV> 

This is some more text<BR> 
</BODYX/HTML> 

15 — ABCD 

Content-ID : <EXTERNAL> 
Content-Type : image/gif / 

name="image001 . gif " 
Content-Transfer-Encoding: base64 
20 Content-Disposition : attachment ; 

f ilename="image001 . gif " 

R01GODlhFwAXAMQAAICAgGRWBA7^FJSU8irBP39/f/YAKqQBP/OAAshVzQrA8bGx7Cuq412BRYV 
FxgUAiYnLMaxTL2+w+/LAyQdAgMLHgAqhEc8AwwKBZONcqGfl+Lh4dvh9ti6A//tAeS+HiH5BAAA 
25 AAAALAAAAAAXABcAAAX2ICCOZGmOSKqu6mQg74uI7PoSTXNOBP/SNcOkcwg8FIpirgdcDTtIjEDw 
uCAbgUMzZSBcGtOwn01IJiAzoV+ciboelogPW+nDbpyLpXeDQtx9xDwEUbwMZCxoOU3pJHSIBAWB8 
ABIFBRIDAwIJAwAYFAEEjxllEAuWlgMQmhYJG5oCFyITYBinqAWwFaOMEFOAABOEA7iWDA4 JFhYV 
DoJlcVMZxZYIAxW/Bx5DImwCDLgbEhmq7\hgBHTIGIsMLAJkQf Z91BGgrccQZkJA6BC71LGcicIiQ 
pmANewAQfNgQ4aDDFDQ+FNCA7mGNiJcCyLCo4oTHjy£AADs= 

30 



—ABCD— 
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CLAIMS 

1 . A content scanning system for electronic documents such as emails 

comprising: 

5 a) a link analyser for identifying hyperlinks in document content; 

b) means for causing a content scanner to scan objects referenced by links 
identified by the link analyser and to determine their acceptability according to predefined 
rules, the means being operative, when the link is to an object external to the document and is 
determined by the content analyser to be acceptable, to retrieve the external object and modify 
10 the document by 

bl . embedding in it or attaching to it the retrieved copy of the object; and 
b2. replacing the link to the external object by one to the copy embedded in, 
or attached to, the document. 

15 2. A system according to claim 1 wherein the link analyser a) and means b) are 

operative to recursively process links identified in such external objects. 

3. A system according to claim 2 in which only a maximum depth of recursion is 
permitted and the document is flagged as unacceptable if that limit is reached. 

20 

4. A system according to claim 1, 2 or 3 wherein acceptable retrieved objects are 
encoded into MIME format. 
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5. A system according to any one of the preceding claims wherein if any linked-to 

object is determined by the content scanner to be unacceptable the document is flagged or 
modified to indicate that fact. 

5 6. A method of content-scanning electronic documents such as emails 

comprising: 

a) using a link analyser for identifying hyperlinks in document content; 

b) using a content scanner to scan objects referenced by links identified by 
the link analyser and to determine their acceptability according to predefined rules, the means 

10 being operative, when the link is to an object external to the document and is determined by 
the content analyser to be acceptable, to retrieve the external object and modify the document 
by 

bl . embedding in it or attaching to it the retrieved copy of the object; and 
b2, replacing the link to the external object by one to the copy embedded in, 
15 or attached to, the document, 

7. A method according to claim 6 wherein the steps a) and b) are used recursively 

to process links identified in such external objects. 

20 8. A method according to claim 7 in which only a maximum depth of recursion is 

permitted and the document is flagged as unacceptable if that limit is reached. 

9. A system according to claim 6, 7 or 8 wherein acceptable retrieved objects are 

encoded into MIME format. 



25 
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10. A method according to any one of claim 6 to 9, wherein if any linked-to object 

is determined by the content scanner to be unacceptable the document is flagged or modified 
to indicate that fact. 

5 11. A content scanning system for electronic documents substantially as hereinbefore 
described and with reference to the accompanying drawings. 

12. A method of content-scanning electronic documents substantially as hereinbefore 
described and with reference to the accompanying drawings. 



10 
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